Skip to content

Comments

Improve megatron dataset preprocessing script and update docs#918

Open
kevalmorabia97 wants to merge 1 commit intomainfrom
kmorabia/megatron-data-preprocess-cleanup
Open

Improve megatron dataset preprocessing script and update docs#918
kevalmorabia97 wants to merge 1 commit intomainfrom
kmorabia/megatron-data-preprocess-cleanup

Conversation

@kevalmorabia97
Copy link
Collaborator

@kevalmorabia97 kevalmorabia97 commented Feb 23, 2026

What does this PR do?

Improve megatron dataset preprocessing script and update docs

Usage

python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
    --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
    --hf_name Nemotron-SFT-General \
    --hf_split train \
    --hf_max_samples_per_split 10_000_000 \
    --json_keys text \
    --tokenizer Qwen/Qwen3-0.6B \
    --output_dir /path/to/tokenized/data/qwen3 \
    --workers 32 \
    --max_sequence_length 256_000
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
    --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
    --json_keys text \
    --tokenizer Qwen/Qwen3-0.6B \
    --output_dir /path/to/tokenized/data/qwen3 \
    --workers 32 \
    --max_sequence_length 256_000

Testing

  • Downloaded and tokenized Nemotron-Post-Training-Dataset-v2

Summary by CodeRabbit

  • Documentation

    • Updated data preparation guides with new CLI patterns and Hugging Face Hub integration instructions.
  • New Features

    • Added batch tokenization via directory input and direct Hugging Face dataset downloads with flexible subset/split filtering.
  • Configuration Updates

    • Optimized distillation settings: adjusted optimizer parameters and increased checkpoint retention.

@kevalmorabia97 kevalmorabia97 requested review from a team as code owners February 23, 2026 09:01
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

The PR refactors the Megatron preprocessing API from a single input_path parameter to a multi-source interface supporting input_dir, jsonl_paths, or dataset sources with HuggingFace dataset download integration. Documentation, tests, and dependent code are updated accordingly. Minor configuration adjustments to hyperparameters and logging messages are made.

Changes

Cohort / File(s) Summary
Documentation & Examples
examples/megatron_bridge/README.md
Updated data preparation instructions to show CLI invocations instead of Python function calls; documented new tokenization options (input_dir, dataset, subset, split) and revised example output paths.
Core API Refactoring
modelopt/torch/utils/plugins/megatron_preprocess_data.py
Replaced single-path input model with multi-source API (mutually exclusive: input_dir, jsonl_paths, or dataset); added _download_hf_dataset helper for HuggingFace dataset fetching with subset/split filtering; updated error handling and added force_print option in statistics logging.
Dependent Code & Configuration Updates
examples/nemo_run/common/process_climbmix.py, examples/megatron_bridge/distill.py
Updated process_climbmix.py to use new jsonl_paths parameter; adjusted distill.py hyperparameters (adam_beta2 0.98→0.95, most_recent_k 3→5) and enhanced end-of-distillation logging messages.
Test Infrastructure Updates
tests/unit/torch/utils/test_megatron_preprocess_data.py, tests/_test_utils/torch/megatron/utils.py
Updated test skip conditions to skip_if_no_megatron(te_required=False); renamed minipile test to reflect JSONL input; refactored assertions to validate .bin/.idx file presence instead of exact filenames; added new test for HuggingFace dataset input.
Minor Updates
modelopt/torch/prune/plugins/mcore_minitron.py
Added no-op comment placeholder before hybrid model pattern override during pruning search.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: improvements to the megatron dataset preprocessing script and documentation updates across multiple files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kmorabia/megatron-data-preprocess-cleanup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.10%. Comparing base (02fa362) to head (badcfd1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #918      +/-   ##
==========================================
- Coverage   73.11%   73.10%   -0.02%     
==========================================
  Files         205      205              
  Lines       22281    22281              
==========================================
- Hits        16291    16288       -3     
- Misses       5990     5993       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

172-179: ⚠️ Potential issue | 🟠 Major

UnboundLocalError on i when processing an empty JSONL file.

i is assigned only inside the enumerate loop at line 172. If encoded_docs yields no items (empty file), i is never defined, and the new force_print call on line 179 raises UnboundLocalError.

🐛 Proposed fix
-    total_doc_len, total_enc_len, final_enc_len = 0, 0, 0
+    total_doc_len, total_enc_len, final_enc_len, i = 0, 0, 0, 0
     for i, (doc, sentence_lens, (doc_len, enc_len)) in enumerate(encoded_docs, start=1):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` around lines 172 -
179, The loop variable i is undefined when encoded_docs is empty, causing
UnboundLocalError at the final self._print_processing_stats(i, ...) call; fix by
initializing i (e.g., i = 0) before the for i, (doc, sentence_lens, (doc_len,
enc_len)) in enumerate(...) loop or by using a safe fallback when calling
self._print_processing_stats (e.g., pass i if defined else 0); update the code
around the enumerate block in megatron_preprocess_data.py so that i is always
defined before the final force_print call.
🧹 Nitpick comments (2)
modelopt/torch/prune/plugins/mcore_minitron.py (1)

320-320: LGTM — consider opening a tracking issue for this rename.

The TODO comment clearly documents the pending rename of hybrid_override_patternhybrid_layer_pattern once Megatron-LM#3377 merges. No logic or behavior change.

Would you like me to draft a tracking issue for this rename so it doesn't get lost?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/prune/plugins/mcore_minitron.py` at line 320, The comment TODO
notes that the name hybrid_override_pattern should be renamed to
hybrid_layer_pattern after Megatron-LM#3377; create a tracking issue in your
repo (or project board) documenting this rename, include the file and symbol
names (hybrid_override_pattern in
modelopt/torch/prune/plugins/mcore_minitron.py), mention the PR that triggers it
(NVIDIA/Megatron-LM#3377), and list required follow-ups (update the TODO
comment, rename the symbol across codebase, adjust any docs/tests that reference
hybrid_override_pattern) so the change won't be lost.
modelopt/torch/utils/plugins/megatron_preprocess_data.py (1)

271-271: Mutable default argument for json_keys.

json_keys: list[str] = ["text"] shares a single list object across all callers that omit the argument. While the list is currently only read (not mutated) inside the function, this is a Python antipattern that could cause subtle bugs if the implementation changes.

♻️ Proposed fix
-    json_keys: list[str] = ["text"],
+    json_keys: list[str] | None = None,

Then at the top of the function body:

    if json_keys is None:
        json_keys = ["text"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py` at line 271, The
parameter json_keys is defined with a mutable default list (`json_keys:
list[str] = ["text"]`), so change the function signature to use None as the
default (e.g., `json_keys: list[str] | None = None`) and add a guard at the top
of the function (in modelopt/torch/utils/plugins/megatron_preprocess_data.py)
that sets `json_keys = ["text"]` when `json_keys is None`; reference the
parameter name `json_keys` to locate and update the code and ensure any type
hints are adjusted accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/README.md`:
- Around line 103-111: The bash example uses inline comments after a
backslash-continued line (e.g., the lines containing `--json_keys text \  #
change to your JSON key if needed` and `--tokenizer Qwen/Qwen3-0.6B \  # fyi,
all models...`), which breaks line continuation; remove those trailing inline
comments or move them to their own lines above the command, or place comments
only on the final non-continued line, and ensure every `\` is the last character
on the line so the `python -m
modelopt.torch.utils.plugins.megatron_preprocess_data` command and flags
(`--jsonl_paths`, `--json_keys`, `--tokenizer`, `--output_dir`, `--workers`,
`--max_sequence_length`) are valid when copy-pasted.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 301-315: Move the creation of the output directory to before
calling _download_hf_dataset so the downloader can write files; specifically,
call Path(output_dir).mkdir(parents=True, exist_ok=True) prior to invoking
_download_hf_dataset(dataset, output_dir, json_keys, subset=subset, split=split)
in the code block that contains the dataset/download logic, leaving the rest of
the file discovery logic for input_dir/jsonl_paths unchanged.
- Around line 229-255: _download_hf_dataset currently builds jsonl_file_path
using ds_path which may contain slashes and never ensures parent directories
exist, causing FileNotFoundError when ds.to_json is called; fix by sanitizing or
using only the dataset name (not the HF org prefix) when composing
jsonl_file_path and ensure the parent directory is created before writing:
derive a safe filename from ds_path (or replace '/' with '_' or use
Path(ds_path).name), compute parent = Path(output_dir)/"raw"/<safe_name> and
call parent.mkdir(parents=True, exist_ok=True) before checking/creating
jsonl_file_path and before calling ds.to_json; also ensure output_dir is created
earlier (or rely on the same mkdir(parents=True)) so writes never fail.

In `@tests/_test_utils/torch/megatron/utils.py`:
- Line 23: The shared utils module uses skip_if_no_megatron(te_required=False)
which allows import to run TE-dependent megatron.core model imports (e.g.,
GPTModel, MambaModel, GPTInferenceWrapper) and causes ImportError when TE is
missing; update the guard so the module is only imported when TE is available by
changing the skip invocation to skip_if_no_megatron(te_required=True) (or, if
utility functions truly don't need TE, move the skip_if_no_megatron call to
after the TE-free helper definitions and keep TE imports behind a separate
guarded block); ensure initialize_for_megatron and any TE-dependent symbols are
only loaded under the TE-required guard to keep imports consistent with
tests/models.py.

In `@tests/unit/torch/utils/test_megatron_preprocess_data.py`:
- Around line 98-121: The test test_megatron_preprocess_data_with_hf_dataset
makes live network calls and also writes JSON into a non-existent raw/
subdirectory in _download_hf_dataset; update the test to skip in offline CI by
adding a guard (e.g., pytest.mark.skipif or an env-check like
SKIP_NETWORK_TESTS) before running megatron_preprocess_data, and in
_download_hf_dataset ensure the target directory (the constructed raw/ path) is
created (os.makedirs(..., exist_ok=True)) before calling ds.to_json(); reference
the test function name test_megatron_preprocess_data_with_hf_dataset, the helper
_download_hf_dataset, and the call sites around load_dataset and ds.to_json when
implementing these changes.

---

Outside diff comments:
In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Around line 172-179: The loop variable i is undefined when encoded_docs is
empty, causing UnboundLocalError at the final self._print_processing_stats(i,
...) call; fix by initializing i (e.g., i = 0) before the for i, (doc,
sentence_lens, (doc_len, enc_len)) in enumerate(...) loop or by using a safe
fallback when calling self._print_processing_stats (e.g., pass i if defined else
0); update the code around the enumerate block in megatron_preprocess_data.py so
that i is always defined before the final force_print call.

---

Nitpick comments:
In `@modelopt/torch/prune/plugins/mcore_minitron.py`:
- Line 320: The comment TODO notes that the name hybrid_override_pattern should
be renamed to hybrid_layer_pattern after Megatron-LM#3377; create a tracking
issue in your repo (or project board) documenting this rename, include the file
and symbol names (hybrid_override_pattern in
modelopt/torch/prune/plugins/mcore_minitron.py), mention the PR that triggers it
(NVIDIA/Megatron-LM#3377), and list required follow-ups (update the TODO
comment, rename the symbol across codebase, adjust any docs/tests that reference
hybrid_override_pattern) so the change won't be lost.

In `@modelopt/torch/utils/plugins/megatron_preprocess_data.py`:
- Line 271: The parameter json_keys is defined with a mutable default list
(`json_keys: list[str] = ["text"]`), so change the function signature to use
None as the default (e.g., `json_keys: list[str] | None = None`) and add a guard
at the top of the function (in
modelopt/torch/utils/plugins/megatron_preprocess_data.py) that sets `json_keys =
["text"]` when `json_keys is None`; reference the parameter name `json_keys` to
locate and update the code and ensure any type hints are adjusted accordingly.
ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02fa362 and 7ca0641.

📒 Files selected for processing (7)
  • examples/megatron_bridge/README.md
  • examples/megatron_bridge/distill.py
  • examples/nemo_run/common/process_climbmix.py
  • modelopt/torch/prune/plugins/mcore_minitron.py
  • modelopt/torch/utils/plugins/megatron_preprocess_data.py
  • tests/_test_utils/torch/megatron/utils.py
  • tests/unit/torch/utils/test_megatron_preprocess_data.py

@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-data-preprocess-cleanup branch 6 times, most recently from fe545ef to badcfd1 Compare February 23, 2026 16:25
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 force-pushed the kmorabia/megatron-data-preprocess-cleanup branch from badcfd1 to b805c01 Compare February 23, 2026 17:46
@kevalmorabia97 kevalmorabia97 changed the title Cleanup megatron dataset preprocessing script and update docs Improve megatron dataset preprocessing script and update docs Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant